feat: deterministic replay for recorded RLM runs by errantsky · Pull Request #21 · errantsky/exrlm

errantsky · 2026-03-06T04:59:35Z

Summary

Adds the ability to record LLM responses during a run and replay them later without making live API calls. This enables:

Debugging — replay a failed run, patch one iteration's code to test a fix
Regression testing — replay a successful run against a new codebase version
Model comparison — replay with a different model via the :live fallback
Cost optimization — re-execute eval steps without any LLM calls

How it works

Recording

When enable_replay_recording: true is set, the Worker emits a [:rlm, :llm, :response, :recorded] telemetry event after each successful LLM call. The EventLogHandler persists the full response text and usage metadata as :llm_response events in both the in-memory Agent and :dets TraceStore.

The original context and query are also stored in the :node_start event for depth-0 workers, so replay can recover the inputs.

Replay

RLM.replay(run_id) builds a Tape (ordered list of recorded responses) from the EventLog, then starts a new Worker that uses RLM.Replay.LLM — a process-dict-based LLM behaviour implementation that returns responses from the tape instead of calling the API. All eval'd code is re-executed normally.

Patching

RLM.replay(run_id, patch: %{0 => "new_code"}) replaces the code at iteration 0 before eval. The tape entry is still consumed to maintain iteration alignment.

Fallback

RLM.replay(run_id, fallback: :live, config: [llm_module: RLM.LLM]) uses RLM.Replay.FallbackLLM, which consumes tape entries first and switches to live LLM calls when exhausted. This handles the case where a patch causes extra iterations beyond the recorded tape length.

New modules

Module	Purpose
`RLM.Replay`	Orchestrator: `replay/2` with patch/fallback/config support
`RLM.Replay.Tape`	Struct + `from_events/1` builder from EventLog/TraceStore
`RLM.Replay.LLM`	LLM behaviour — returns tape responses via process dict
`RLM.Replay.FallbackLLM`	LLM behaviour — tape first, then live fallback

Modified modules

Module	Change
`RLM.Config`	Added `enable_replay_recording` field (default: `false`)
`RLM.Worker`	`replay_patches` struct field, tape loading in init, patch application before eval, LLM response recording telemetry
`RLM.Telemetry`	Registered `[:rlm, :llm, :response, :recorded]` event
`RLM.Telemetry.EventLogHandler`	Handler for `:llm_response` events + `original_context`/`original_query` in `:node_start`
`RLM`	`replay/2` public API, updated boundary exports

Design decisions

Process dict for tape state — The RLM.LLM behaviour's chat/4 doesn't have a replay-state argument. Rather than changing the behaviour (breaking all implementations), the tape lives in the Worker's process dict, matching RLM.Eval's existing pattern.
Code-level patches, not response-level — Patches replace the code that gets eval'd, not the LLM response. The tape entry is still consumed to maintain iteration alignment. This is the most useful granularity for debugging.
Recording is opt-in — Full LLM responses can be large, so enable_replay_recording defaults to false.
Subcall replay deferred — This replays root worker iterations only. Subcall replay (child workers with their own tapes) is a natural extension but adds significant complexity.

Test plan

🤖 Generated with Claude Code

Enable replaying previously recorded runs without making live LLM calls. Recorded LLM responses are stored as trace events and consumed in order during replay, re-executing all eval'd code deterministically. New modules: - RLM.Replay — orchestrator with patch support for code substitution - RLM.Replay.Tape — builds ordered response sequences from EventLog - RLM.Replay.LLM — LLM behaviour impl using process-dict tape state Recording infrastructure: - enable_replay_recording config flag (default: false) - [:rlm, :llm, :response, :recorded] telemetry event - original_context/query stored in node_start events (depth-0) - replay_patches field on Worker struct for code patching Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a replay patch causes extra iterations beyond what the tape recorded, the :live fallback switches to a real LLM module instead of returning an error. The fallback module is configurable via the :config option's llm_module key. New module: RLM.Replay.FallbackLLM — tries tape first, delegates to a live LLM module when entries are exhausted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

errantsky and others added 2 commits March 5, 2026 17:13

errantsky merged commit c111015 into main Mar 6, 2026
1 check passed

errantsky deleted the feat/deterministic-replay branch March 6, 2026 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deterministic replay for recorded RLM runs#21

feat: deterministic replay for recorded RLM runs#21
errantsky merged 2 commits intomainfrom
feat/deterministic-replay

errantsky commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

errantsky commented Mar 6, 2026

Summary

How it works

Recording

Replay

Patching

Fallback

New modules

Modified modules

Design decisions

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant